In [1]:
%pylab inline
import pylab as pl
import numpy as np
# Some nice default configuration for plots
pl.rcParams['figure.figsize'] = 10, 7.5
pl.rcParams['axes.grid'] = True
pl.gray()
Outline of this section:
Let's start by implementing a canonical text classification example:
In [2]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
# Load the text data
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
twenty_train_small = load_files('../datasets/20news-bydate-train/',
categories=categories, charset='latin-1')
twenty_test_small = load_files('../datasets/20news-bydate-test/',
categories=categories, charset='latin-1')
# Turn the text documents into vectors of word frequencies
vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(twenty_train_small.data)
y_train = twenty_train_small.target
# Fit a classifier on the training set
classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
classifier.score(X_train, y_train) * 100))
# Evaluate the classifier on the testing set
X_test = vectorizer.transform(twenty_test_small.data)
y_test = twenty_test_small.target
print("Testing score: {0:.1f}%".format(
classifier.score(X_test, y_test) * 100))
Here is a workflow diagram summary of what happened previously:
Let's now decompose what we just did to understand and customize each step.
Let's explore the dataset loading utility without passing a list of categories: in this case we load the full 20 newsgroups dataset in memory. The source website for the 20 newsgroups already provides a date-based train / test split that is made available using the subset
keyword argument:
In [3]:
ls -l ../datasets/
In [4]:
ls -lh ../datasets/20news-bydate-train
In [5]:
ls -lh ../datasets/20news-bydate-train/alt.atheism/
The load_files
function can load text files from a 2 levels folder structure assuming folder names represent categories:
In [6]:
#print(load_files.__doc__)
In [7]:
all_twenty_train = load_files('../datasets/20news-bydate-train/',
charset='latin-1', random_state=42)
all_twenty_test = load_files('../datasets/20news-bydate-test/',
charset='latin-1', random_state=42)
In [8]:
all_target_names = all_twenty_train.target_names
all_target_names
Out[8]:
In [9]:
all_twenty_train.target
Out[9]:
In [10]:
all_twenty_train.target.shape
Out[10]:
In [11]:
all_twenty_test.target.shape
Out[11]:
In [12]:
len(all_twenty_train.data)
Out[12]:
In [13]:
type(all_twenty_train.data[0])
Out[13]:
In [14]:
def display_sample(i, dataset):
print("Class name: " + dataset.target_names[dataset.target[i]])
print("Text content:\n")
print(dataset.data[i])
In [15]:
display_sample(0, all_twenty_train)
In [16]:
display_sample(1, all_twenty_train)
Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8 bit encoding (in this case, all chars can be encoded using the latin-1 charset).
In [17]:
def text_size(text, charset='iso-8859-1'):
return len(text.encode(charset)) * 8 * 1e-6
train_size_mb = sum(text_size(text) for text in all_twenty_train.data)
test_size_mb = sum(text_size(text) for text in all_twenty_test.data)
print("Training set size: {0} MB".format(int(train_size_mb)))
print("Testing set size: {0} MB".format(int(test_size_mb)))
If we only consider a small subset of the 4 categories selected from the initial example:
In [18]:
train_small_size_mb = sum(text_size(text) for text in twenty_train_small.data)
test_small_size_mb = sum(text_size(text) for text in twenty_test_small.data)
print("Training set size: {0} MB".format(int(train_small_size_mb)))
print("Testing set size: {0} MB".format(int(test_small_size_mb)))
In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVectorizer()
Out[19]:
In [20]:
vectorizer = TfidfVectorizer(min_df=1)
%time X_train_small = vectorizer.fit_transform(twenty_train_small.data)
The results is not a numpy.array
but instead a scipy.sparse
matrix. This datastructure is quite similar to a 2D numpy array but it does not store the zeros.
In [21]:
X_train_small
Out[21]:
scipy.sparse matrices also have a shape attribute to access the dimensions:
In [22]:
n_samples, n_features = X_train_small.shape
This dataset has around 2000 samples (the rows of the data matrix):
In [23]:
n_samples
Out[23]:
This is the same value as the number of strings in the original list of text documents:
In [24]:
len(twenty_train_small.data)
Out[24]:
The columns represent the individual token occurrences:
In [25]:
n_features
Out[25]:
This number is the size of the vocabulary of the model extracted during fit in a Python dictionary:
In [26]:
type(vectorizer.vocabulary_)
Out[26]:
In [27]:
len(vectorizer.vocabulary_)
Out[27]:
The keys of the vocabulary_
attribute are also called feature names and can be accessed as a list of strings.
In [28]:
len(vectorizer.get_feature_names())
Out[28]:
Here are the first 10 elements (sorted in lexicographical order):
In [29]:
vectorizer.get_feature_names()[:10]
Out[29]:
Let's have a look at the features from the middle:
In [30]:
vectorizer.get_feature_names()[n_features / 2:n_features / 2 + 10]
Out[30]:
Now that we have extracted a vector representation of the data, it's a good idea to project the data on the first 2D of a Principal Component Analysis to get a feel of the data. Note that the RandomizedPCA
class can accept scipy.sparse
matrices as input (as an alternative to numpy arrays):
In [31]:
from sklearn.decomposition import RandomizedPCA
%time X_train_small_pca = RandomizedPCA(n_components=2).fit_transform(X_train_small)
In [33]:
from itertools import cycle
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
for i, c in zip(np.unique(y_train), cycle(colors)):
pl.scatter(X_train_small_pca[y_train == i, 0],
X_train_small_pca[y_train == i, 1],
c=c, label=twenty_train_small.target_names[i], alpha=0.5)
#_ = pl.legend(loc='best')
We can observe that there is a large overlap of the samples from different categories. This is to be expected as the PCA linear projection projects data from a 34118 dimensional space down to 2 dimensions: data that is linearly separable in 34118D is often no longer linearly separable in 2D.
Still we can notice an interesting pattern: the newsgroups on religion and atheism occupy the much the same region and computer graphics and space science / space overlap more together than they do with the religion or atheism newsgroups.
We have previously extracted a vector representation of the training corpus and put it into a variable name X_train_small
. To train a supervised model, in this case a classifier, we also need
In [34]:
y_train_small = twenty_train_small.target
In [35]:
y_train_small.shape
Out[35]:
In [36]:
y_train_small
Out[36]:
We can shape that we have the same number of samples for the input data and the labels:
In [37]:
X_train_small.shape[0] == y_train_small.shape[0]
Out[37]:
We can now train a classifier, for instance a Multinomial Naive Bayesian classifier:
In [38]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(alpha=0.1)
clf
Out[38]:
In [39]:
clf.fit(X_train_small, y_train_small)
Out[39]:
We can now evaluate the classifier on the testing set. Let's first use the builtin score function, which is the rate of correct classification in the test set:
In [40]:
X_test_small = vectorizer.transform(twenty_test_small.data)
y_test_small = twenty_test_small.target
In [41]:
X_test_small.shape
Out[41]:
In [42]:
y_test_small.shape
Out[42]:
In [43]:
clf.score(X_test_small, y_test_small)
Out[43]:
We can also compute the score on the test set and observe that the model is both overfitting and underfitting a bit at the same time:
In [44]:
clf.score(X_train_small, y_train_small)
Out[44]:
The text vectorizer has many parameters to customize it's behavior, in particular how it extracts tokens:
In [45]:
TfidfVectorizer()
Out[45]:
In [46]:
print(TfidfVectorizer.__doc__)
The easiest way to introspect what the vectorizer is actually doing for a given test of parameters is call the vectorizer.build_analyzer()
to get an instance of the text analyzer it uses to process the text:
In [47]:
analyzer = TfidfVectorizer().build_analyzer()
analyzer("I love scikit-learn: this is a cool Python lib!")
Out[47]:
You can notice that all the tokens are lowercase, that the single letter word "I" was dropped, and that hyphenation is used. Let's change some of that default behavior:
In [48]:
analyzer = TfidfVectorizer(
preprocessor=lambda text: text, # disable lowercasing
token_pattern=ur'(?u)\b[\w-]+\b', # treat hyphen as a letter
# do not exclude single letter tokens
).build_analyzer()
analyzer("I love scikit-learn: this is a cool Python lib!")
Out[48]:
The analyzer name comes from the Lucene parlance: it wraps the sequential application of:
The analyzer system of scikit-learn is much more basic than lucene's though.
Exercise:
Hint: the TfidfVectorizer
class can accept python functions to customize the preprocessor
, tokenizer
or analyzer
stages of the vectorizer.
type TfidfVectorizer()
alone in a cell to see the default value of the parameters
type TfidfVectorizer.__doc__
to print the constructor parameters doc or ?
suffix operator on a any Python class or method to read the docstring or even the ??
operator to read the source code.
In [ ]:
The MultinomialNB
class is a good baseline classifier for text as it's fast and has few parameters to tweak:
In [49]:
MultinomialNB()
Out[49]:
In [50]:
print(MultinomialNB.__doc__)
By reading the doc we can see that the alpha
parameter is a good candidate to adjust the model for the bias (underfitting) vs variance (overfitting) trade-off.
Exercise:
sklearn.grid_search.GridSearchCV
or the model_selection.RandomizedGridSeach
utility function from the previous chapters to find a good value for the parameter alpha
Hints:
RandomizedGridSearch
also has a launch_for_arrays
method as an alternative to launch_for_splits
in case the CV splits have not been precomputed in advance.
1
In [ ]:
The feature extraction class has many options to customize its behavior:
In [51]:
print(TfidfVectorizer.__doc__)
In order to evaluate the impact of the parameters of the feature extraction one can chain a configured feature extraction and linear classifier (as an alternative to the naive Bayes model:
In [58]:
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.pipeline import Pipeline
pipeline = Pipeline((
('vec', TfidfVectorizer(min_df=1, max_df=0.8, use_idf=True)),
('clf', PassiveAggressiveClassifier(C=1)),
))
Such a pipeline can then be cross validated or even grid searched:
In [59]:
from sklearn.cross_validation import cross_val_score
from scipy.stats import sem
scores = cross_val_score(pipeline, twenty_train_small.data,
twenty_train_small.target, cv=3, n_jobs=-1)
scores.mean(), sem(scores)
Out[59]:
For the grid search, the parameters names are prefixed with the name of the pipeline step using "__" as a separator:
In [60]:
from sklearn.grid_search import GridSearchCV
parameters = {
#'vec__min_df': [1, 2],
'vec__max_df': [0.8, 1.0],
'vec__ngrams_range': [(1, 1), (1, 2)],
'vec__use_idf': [True, False],
}
gs = GridSearchCV(pipeline, parameters, verbose=2, refit=False)
_ = gs.fit(twenty_train_small.data, twenty_train_small.target)
In [61]:
gs.best_score_
In [62]:
gs.best_params_
Let's fit a model on the small dataset and collect info on the fitted components:
In [63]:
_ = pipeline.fit(twenty_train_small.data, twenty_train_small.target)
In [64]:
vec_name, vec = pipeline.steps[0]
clf_name, clf = pipeline.steps[1]
feature_names = vec.get_feature_names()
target_names = twenty_train_small.target_names
feature_weights = clf.coef_
feature_weights.shape
Out[64]:
By sorting the feature weights on the linear model and asking the vectorizer what their names is, one can get a clue on what the model did actually learn on the data:
In [65]:
def display_important_features(feature_names, target_names, weights, n_top=30):
for i, target_name in enumerate(target_names):
print("Class: " + target_name)
print("")
sorted_features_indices = weights[i].argsort()[::-1]
most_important = sorted_features_indices[:n_top]
print(", ".join("{0}: {1:.4f}".format(feature_names[j], weights[i, j])
for j in most_important))
print("...")
least_important = sorted_features_indices[-n_top:]
print(", ".join("{0}: {1:.4f}".format(feature_names[j], weights[i, j])
for j in least_important))
print("")
display_important_features(feature_names, target_names, feature_weights)
In [66]:
from sklearn.metrics import classification_report
predicted = pipeline.predict(twenty_test_small.data)
In [67]:
print(classification_report(twenty_test_small.target, predicted,
target_names=twenty_test_small.target_names))
The confusion matrix summarize which class where by having a look at off-diagonal entries: here we can see that articles about atheism have been wrongly classified as being about religion 57 times for instance:
In [68]:
from sklearn.metrics import confusion_matrix
confusion_matrix(twenty_test_small.target, predicted)
Out[68]:
In [ ]: